Using Data Mining Techniques into Real Estates Industry

Author: Madalina-Alina Racovita, 1st year master's student on Computational Optimization at Faculty of Computer Science, UAIC, Iasi, Romania

title

Import dependencies & environment configuration

In [1]:
!pip install researchpy
!pip install pydotplus
!pip install xgboost
!pip install imblearn
Requirement already satisfied: researchpy in c:\tools\anaconda3\lib\site-packages (0.1.9)
Requirement already satisfied: pandas in c:\tools\anaconda3\lib\site-packages (from researchpy) (0.25.2)
Requirement already satisfied: statsmodels in c:\tools\anaconda3\lib\site-packages (from researchpy) (0.11.1)
Requirement already satisfied: scipy in c:\tools\anaconda3\lib\site-packages (from researchpy) (1.2.1)
Requirement already satisfied: numpy in c:\tools\anaconda3\lib\site-packages (from researchpy) (1.18.1)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\tools\anaconda3\lib\site-packages (from pandas->researchpy) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in c:\tools\anaconda3\lib\site-packages (from pandas->researchpy) (2019.3)
Requirement already satisfied: patsy>=0.5 in c:\tools\anaconda3\lib\site-packages (from statsmodels->researchpy) (0.5.1)
Requirement already satisfied: six>=1.5 in c:\tools\anaconda3\lib\site-packages (from python-dateutil>=2.6.1->pandas->researchpy) (1.14.0)
Requirement already satisfied: pydotplus in c:\tools\anaconda3\lib\site-packages (2.0.2)
Requirement already satisfied: pyparsing>=2.0.1 in c:\tools\anaconda3\lib\site-packages (from pydotplus) (2.4.6)
Requirement already satisfied: xgboost in c:\tools\anaconda3\lib\site-packages (0.90)
Requirement already satisfied: numpy in c:\tools\anaconda3\lib\site-packages (from xgboost) (1.18.1)
Requirement already satisfied: scipy in c:\tools\anaconda3\lib\site-packages (from xgboost) (1.2.1)
Requirement already satisfied: imblearn in c:\tools\anaconda3\lib\site-packages (0.0)
Requirement already satisfied: imbalanced-learn in c:\tools\anaconda3\lib\site-packages (from imblearn) (0.6.2)
Requirement already satisfied: scipy>=0.17 in c:\tools\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.2.1)
Requirement already satisfied: joblib>=0.11 in c:\tools\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (0.14.0)
Requirement already satisfied: numpy>=1.11 in c:\tools\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.18.1)
Requirement already satisfied: scikit-learn>=0.22 in c:\tools\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (0.22)
In [2]:
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import numpy as np
import researchpy as rp
import pydotplus

from scipy.stats import chi2_contingency
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from collections import Counter
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics 
from sklearn.neighbors import KNeighborsClassifier

os.environ['PATH'] = os.environ['PATH'] + ';' + os.environ['CONDA_PREFIX'] + r"\Library\bin\graphviz"

warnings.filterwarnings('ignore')
matplotlib.style.use('ggplot')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)
C:\tools\Anaconda3\lib\site-packages\sklearn\externals\six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
  "(https://pypi.org/project/six/).", FutureWarning)

Load dataframes

The dataframes are going to be loaded in an unprocessed version since, for this task the feature selection has to be made using empirical and encapsulated manners. It is specified the fact that the features UnitsInBuilding and Stories are going to be removed from the dataframe, the classification task is going to be pursued only of the real estates geographically localized in the Washington state.

In [3]:
os.listdir('./../Data')
Out[3]:
['RCON_12011.assessor.tsv',
 'RCON_53033.assessor.tsv',
 'RSFR_12011.assessor.tsv',
 'RSFR_53033.assessor.tsv']
In [4]:
df_rcon = pd.concat([pd.read_csv("./../Data/RCON_12011.assessor.tsv", sep = "\t"), 
                     pd.read_csv("./../Data/RCON_53033.assessor.tsv", sep = "\t")])
df_rsfr = pd.concat([pd.read_csv("./../Data/RSFR_12011.assessor.tsv", sep = "\t"), 
                     pd.read_csv("./../Data/RSFR_53033.assessor.tsv", sep = "\t")])

df_rcon['isRCON'] = 1
df_rsfr['isRCON'] = 0

df = pd.concat([df_rcon, df_rsfr])
In [5]:
del df['UnitsInBuilding']
del df['Stories']
In [6]:
df = df[df['State'] == 'WA']

Preprocessing steps

Removing constant features

The columns that have a single value in the entire dataframe are going to be dropped since they have no predictive value. Their names were detected during EDA.

In [7]:
columns_constant_features = ['AtticSqft', 'IsFixer', 'GarageNoOfCars', 'EffectiveYearBuilt', 'TotalRooms', 'State']
for current in columns_constant_features:
    del df[current]

Replacing the missing values

Removing columns with missing percentage $\ge$ 0.95

The features that have a missing percentage greater than 95% are going to be dropped, since tey have no predictive value and increase the model complexity with no reason.

In [8]:
column_miss_perc_ge_95 = ['DeedType', 'RoofCode', 'BuildingShapeCode', 'City', 'StructureCode']
for column in column_miss_perc_ge_95:
    del df[column]

Replacing the missing values correspondingly to the feature type

The missing values for the object type columns are going to be replaced with an empty string, the ones for numerical columns with 0, except from SellPrice which is going to be replaced with the mean corresponding to the real estates category from which is belonging: residential or RCON.

In [9]:
features_with_missing_values = ['BuildingCode', 'GarageCarportCode', 'PatioPorchCode', 'PoolCode', 'Zonning', \
                                'PropTaxAmount', 'FoundationCode', 'ExteriorCode', 'CoolingCode', 'HeatingCode', \
                                'HeatingSourceCode', 'View', 'DocType', 'TransType', 'DistressCode', 'SellPrice']
In [10]:
object_miss_values_features = []
numeric_miss_values_features = []
types = df[features_with_missing_values].dtypes

for i in range(len(types)):
    if types[i] == object:
        object_miss_values_features.append(features_with_missing_values[i])
    else:
        numeric_miss_values_features.append(features_with_missing_values[i])

for column in object_miss_values_features:
    df[column] = df[column].fillna('')
    
rcon_medium_sellprice = df_rcon['SellPrice'].mean()
rsfr_medium_sellprice = df_rsfr['SellPrice'].mean()

for column in numeric_miss_values_features:
    if column != 'SellPrice':
        df[column] = df[column].fillna(0)
        
prices = []
for (index, row) in df.iterrows():
    added = False
    if pd.isnull(row['SellPrice']):
        row['SellPrice'] = row['LastSalePrice']
    if row['SellPrice'] == 0:
        if row['isRCON'] == 1:
            prices.append(rcon_medium_sellprice)
        else:
            prices.append(rsfr_medium_sellprice)
    elif row['SellPrice'] != 0:
        prices.append(row['SellPrice'])
        
df['SellPrice'] = prices   

df.loc[(df['isRCON'] == 1) & (df['LastSalePrice'] == 0), 'LastSalePrice'] = rcon_medium_sellprice
df.loc[(df['isRCON'] == 0) & (df['LastSalePrice'] == 0), 'LastSalePrice'] = rsfr_medium_sellprice

Boolean type to numerical

In [11]:
bool_type_columns = df.select_dtypes(include=bool).columns.tolist()
for column in bool_type_columns:
    df[column] = df[column].apply(lambda x: 0 if x == False else 1)
In [12]:
df.head()
Out[12]:
CountyFipsCode BuildingCode StructureNbr LandSqft LivingSqft GarageSqft BasementSqft BasementFinishedSqft Bedrooms TotalBaths FirePlaces YearBuilt Condition ConditionCode Quality QualityCode GarageCarportCode HasPatioPorch PatioPorchCode HasPool PoolCode Zonning LandValue ImprovementValue TotalValue AssessedYear PropTaxAmount Zip Latitude Longitude ConstructionCode FoundationCode ExteriorCode CoolingCode HeatingCode HeatingSourceCode IsWaterfront View ViewScore LastSaleDate LastSalePrice DocType TransType ArmsLengthFlag DistressCode StatusDate SellDate SellPrice OwnerOccupied DistrsdProp isRCON
0 53033 1.0 1 20939 697 0 0 0 1 1.0 1 2004 FAI 2 QAV 6 GB1 0 0.0 0 CB 14099 124899 139000 2016 1696.0 98155 47.770 -122.303 0 0.0 0.0 0.0 0.0 0 12.0 2 2007-10-18 00:00:00 217000.0 W S 1 2012-12-05 17:18:00 2007-10-18 217000.0 1 0 1
1 53033 1.0 1 3999 1440 440 0 0 3 2.0 1 2006 AVE 3 QGO 8 GA0 1 2.0 1 URPSO 207999 221999 430000 2016 5696.0 98053 47.722 -122.030 0 0.0 0.0 0.0 2 2.0 0 0.0 1 2007-10-05 00:00:00 343750.0 W R 1 2012-12-05 21:54:00 2007-10-05 343750.0 0 0 1
2 53033 1.0 1 61240 1030 0 0 0 3 1.0 1 1952 AVE 3 QFA 4 0 0.0 0 RA5 76499 49499 126000 2016 2203.0 98038 47.406 -122.040 0 0.0 0.0 0.0 0.0 0 15.0 3 2006-06-23 00:00:00 275000.0 W R 1 2017-02-17 00:00:00 2006-06-23 275000.0 1 0 1
3 53033 1.0 1 3999 1390 250 410 160 3 3.0 1 2004 AVE 3 QGO 8 GB0 0 0.0 0 LR3 140999 444999 586000 2016 5572.0 98102 47.635 -122.324 0 0.0 0.0 0.0 2 2.0 0 0.0 2 2004-05-20 00:00:00 375000.0 1 2012-12-06 05:29:00 2004-05-20 375000.0 1 0 1
4 53033 1.0 1 189350 1023 0 0 0 2 1.0 1 1978 FAI 2 QAV 6 0 0.0 0 RMA 1.8 52499 103499 156000 2016 1716.0 98034 47.730 -122.241 0 0.0 0.0 0.0 0.0 0 12.0 2 1989-08-07 00:00:00 41000.0 1 2017-02-17 00:00:00 1989-08-07 41000.0 0 0 1

Dropping corner cases observations

The real estates with a price bigger than 2.5 million US dollars are going to be dropped since they are outliers and we want a robust model, which will learn generalities not particularities.

In [13]:
df = df.drop(df[df['SellPrice'] > 2500000].index)

Constructing new features

Some features are going to be rebuilt for increasing their predictive power.

In [14]:
df['SellDate_Year'] = df['SellDate'].apply(lambda x: int(x[:4]))
df['SellDate_Month'] = df['SellDate'].apply(lambda x: int(x[5:7]))
df['SellDate_Day'] = df['SellDate'].apply(lambda x: int(x[8:]))
del df['SellDate']

df['StatusDate'] = df['StatusDate'].apply(lambda x: str(x.split()[0]))
df['SellPricePerLivingSqft'] = df['SellPrice'] / df['LivingSqft']

Features filtration

Before constructing a Machine Learning model it is important to filter the features, since the features uncorrelated with the target class, complicate useless the model and increase the computational complexity. It is desirable a model with as few features as possible, robust and with a great predictive power.

For instance, it can be tolerated a small decrease of acccuracy, if the number of features used in the second model is a lot smaller.

Selecting the most relevant continuous variables

The numerical features are going to be filtered based on the Pearson's correlation coefficient with the target class, isRCON.

In [15]:
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(30, 16))
sns.heatmap(df.corr(), annot=True, fmt=".2f", mask=mask, cmap='magma')
plt.show()

Are going to be kept the features with a correlation coefficient with the isRCON variable greater than 0.08.

In [16]:
relevant_numerical_features = []
for key in corr['isRCON'].keys():
    if abs(corr['isRCON'][key]) > 0.08 and key != 'isRCON':
        relevant_numerical_features.append(key)
relevant_numerical_features
Out[16]:
['BuildingCode',
 'LandSqft',
 'LivingSqft',
 'GarageSqft',
 'BasementSqft',
 'BasementFinishedSqft',
 'Bedrooms',
 'TotalBaths',
 'FirePlaces',
 'YearBuilt',
 'ConditionCode',
 'QualityCode',
 'HasPatioPorch',
 'PatioPorchCode',
 'LandValue',
 'PropTaxAmount',
 'Latitude',
 'Longitude',
 'ConstructionCode',
 'HeatingSourceCode',
 'View',
 'ViewScore',
 'SellPrice',
 'OwnerOccupied',
 'SellDate_Year',
 'SellPricePerLivingSqft']
In [17]:
str(len(corr['isRCON']) - len(relevant_numerical_features)) + ' numerical features were removed due to small correlation from a ' \
+ 'total of ' + str(len(corr['isRCON'])) + ' numerical features'
Out[17]:
'17 numerical features were removed due to small correlation from a total of 43 numerical features'

Let's take a glance on the features that were eliminated.

In [18]:
list(set(corr['isRCON'].keys()).difference(set(relevant_numerical_features))) 
Out[18]:
['isRCON',
 'CountyFipsCode',
 'SellDate_Month',
 'LastSalePrice',
 'ExteriorCode',
 'TotalValue',
 'SellDate_Day',
 'ArmsLengthFlag',
 'ImprovementValue',
 'FoundationCode',
 'HasPool',
 'Zip',
 'AssessedYear',
 'CoolingCode',
 'IsWaterfront',
 'DistrsdProp',
 'StructureNbr']

The categorical features are going to be filterd as well, but before that we have to collect their names.

In [19]:
character_type_features = []
types = df.dtypes

for i in range(len(types)):
    if types[i] == object:
        character_type_features.append(list(df)[i])
In [20]:
len(character_type_features)
Out[20]:
11

Correlation between categorical features

The categorical features are going to be filtered by pursuing some chi-squared tests of independence. The 'so-said' correlation between isRCON and each categorical predictor is going to be measured by the Cramer's V value.

For this task the Chi-square test of independence is going to be used. Chi-square test of independence checks if there is a relationship between two nominal variables. We are considering for instance, two relevant categorical variables: Quality and isRCON, i.e. is going to be tested if the quality of the buildings is higher or lower depending on the real estate type.

The H0 (Null Hypothesis): There is no relationship between Quality and variable isRCON.

The H1 (Alternative Hypothesis): There is a relationship between Quality and isRCON.

If the p-value is significant (as close as possible to 0, preferable smaller than 0.05), you can reject the null hypothesis and claim that the findings support the alternative hypothesis.

In [21]:
contingency_table = pd.crosstab(df['Quality'], df['isRCON'])
contingency_table
Out[21]:
isRCON 0 1
Quality
000 250 411
QAV 23307 8077
QEX 818 89
QFA 6283 194
QGO 22035 2362
QLU 1 0
QPO 120 0
In [22]:
contingency_table, independence_test_results = rp.crosstab(df['Quality'], df['isRCON'], prop='col', test='chi-square')
independence_test_results
Out[22]:
Chi-square test results
0 Pearson Chi-square ( 6.0) = 4445.7996
1 p-value = 0.0000
2 Cramer's V = 0.2637

Cramér's V (sometimes referred to as Cramér's phi and denoted as φc) is a measure of association between two nominal variables.

In [23]:
def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

def get_cramers_v_for_given_features(feature1, feature2):
    confusion_matrix = pd.crosstab(df[feature1], df[feature2]).as_matrix()
    return cramers_v(confusion_matrix)

get_cramers_v_for_given_features('Quality', 'isRCON')
Out[23]:
0.26349658266723136

Analysing Cramer's V values between isRCON and the rest of the nominal predictors

There are going to be kept the features with a Cramer's V value greater than 0.2.

In [24]:
relevant_categorical_features = []
for nominal_feature in character_type_features:
    cramers_v_value = get_cramers_v_for_given_features(nominal_feature, 'isRCON')
    if cramers_v_value > 0.2:
        relevant_categorical_features.append(nominal_feature)
    print(nominal_feature + ' & isRCON Cramer\'s V = ' + str(cramers_v_value))    
Condition & isRCON Cramer's V = 0.8632478584421452
Quality & isRCON Cramer's V = 0.26349658266723136
GarageCarportCode & isRCON Cramer's V = 0.7936183580630225
PoolCode & isRCON Cramer's V = 0.05044388413055082
Zonning & isRCON Cramer's V = 0.8724569994304662
HeatingCode & isRCON Cramer's V = 0.9096367955032864
LastSaleDate & isRCON Cramer's V = 0.18589765625016116
DocType & isRCON Cramer's V = 0.19808772815179745
TransType & isRCON Cramer's V = 0.18763570201821633
DistressCode & isRCON Cramer's V = 0.024183050020849497
StatusDate & isRCON Cramer's V = 0.1981435933573486

The features with a Cramer's V value bigger than 0.2 are going to be kept for the further predictive tasks.

In [25]:
relevant_categorical_features
Out[25]:
['Condition', 'Quality', 'GarageCarportCode', 'Zonning', 'HeatingCode']
In [26]:
relevant_features = relevant_numerical_features + relevant_categorical_features
len(relevant_features)
Out[26]:
31

Therefore, after the filtration based on the Pearson's correlation coefficient and on Cramer's V values 31 features were kept as prectors for the ML model.

Data encoding

The dataframe is going to be encoded so that the categorical features are going to be transformed into numerical ones, by a simmilar One-Hot encoding process. The numerical variables are going to be kept unmodified.

In [27]:
encoded_df = pd.get_dummies(df[relevant_features])
encoded_df['isRCON'] = df['isRCON']
In [28]:
len(list(encoded_df))
Out[28]:
479
In [29]:
encoded_df.head(3)
Out[29]:
BuildingCode LandSqft LivingSqft GarageSqft BasementSqft BasementFinishedSqft Bedrooms TotalBaths FirePlaces YearBuilt ConditionCode QualityCode HasPatioPorch PatioPorchCode LandValue PropTaxAmount Latitude Longitude ConstructionCode HeatingSourceCode View ViewScore SellPrice OwnerOccupied SellDate_Year SellPricePerLivingSqft Condition_AVE Condition_FAI Condition_GOO Condition_POO Condition_VGO Quality_000 Quality_QAV Quality_QEX Quality_QFA Quality_QGO Quality_QLU Quality_QPO GarageCarportCode_ GarageCarportCode_C 0 GarageCarportCode_C 1 GarageCarportCode_C 2 GarageCarportCode_G 0 GarageCarportCode_G 1 GarageCarportCode_G 2 GarageCarportCode_G 3 GarageCarportCode_GA0 GarageCarportCode_GA1 GarageCarportCode_GA2 GarageCarportCode_GB0 GarageCarportCode_GB1 GarageCarportCode_GB2 GarageCarportCode_GB3 GarageCarportCode_GB4 GarageCarportCode_GB6 GarageCarportCode_GD0 Zonning_ Zonning_522 Zonning_A10 Zonning_A10P Zonning_A10SO Zonning_A35 Zonning_ABC Zonning_AG Zonning_AI1 Zonning_AI2 Zonning_AP Zonning_B Zonning_BG Zonning_BN Zonning_BO Zonning_BP Zonning_BR2 Zonning_BRMO Zonning_BRR Zonning_C.C. Zonning_C1 Zonning_C130 Zonning_C140 Zonning_C165 Zonning_C2 Zonning_C230 Zonning_C240 Zonning_C2P40 Zonning_C3 Zonning_CA Zonning_CB Zonning_CBC Zonning_CBD Zonning_CBD 1B Zonning_CBD 2 Zonning_CBD 3 Zonning_CBD 4 Zonning_CBD 6 Zonning_CBD 8 Zonning_CBD-R Zonning_CBDB Zonning_CBP Zonning_CBSO Zonning_CC Zonning_CC1 Zonning_CC2 Zonning_CCMU Zonning_CD Zonning_CE Zonning_CF Zonning_CI Zonning_CLI Zonning_CM2 Zonning_CN Zonning_COR Zonning_CR Zonning_CV Zonning_CZ Zonning_DC Zonning_DCE Zonning_DH255 Zonning_DMC 240290400 Zonning_DMC 340290400 Zonning_DMC160 Zonning_DMC240 Zonning_DMC65 Zonning_DMRC 12565 Zonning_DMRC 240125 Zonning_DMRC 656585 Zonning_DMRC 8565 Zonning_DMRR 12565 Zonning_DMRR 24065 Zonning_DMRR 8565 Zonning_DNTNMU Zonning_DNTNO1 Zonning_DNTNO2 Zonning_DNTNOB Zonning_DNTNR Zonning_DOC1 U450U Zonning_DOC2 50030050 Zonning_DR Zonning_DRC 85150 Zonning_DT Zonning_DUC Zonning_EH Zonning_EP Zonning_EP1 Zonning_EP2 Zonning_F Zonning_GC Zonning_GCMU Zonning_GDC Zonning_HC Zonning_HCB Zonning_HDR Zonning_HR Zonning_I Zonning_IB U45 Zonning_IB U65 Zonning_IDM 7585150 Zonning_IDM7585 Zonning_IDR 45125240 Zonning_IG2 U65 Zonning_IG2 U85 Zonning_JBD 2 Zonning_JBD 5 Zonning_JBD 6 Zonning_L1 Zonning_L2 Zonning_L3 Zonning_LDR Zonning_LDT Zonning_LR1 Zonning_LR1 RC Zonning_LR2 Zonning_LR2 RC Zonning_LR3 Zonning_LR3 PUD Zonning_LR3 RC Zonning_M1 Zonning_M1C Zonning_M2 Zonning_MB Zonning_MC Zonning_MDR Zonning_MDR8 Zonning_MF2 Zonning_MF2L Zonning_MF3 Zonning_MFH Zonning_MFM Zonning_MHO Zonning_MIO37LR2 Zonning_MIO37LR3 Zonning_MIO50LR3 Zonning_MIT Zonning_MPD Zonning_MR Zonning_MR85 Zonning_MRD Zonning_MRG Zonning_MRH Zonning_MRM Zonning_MRRC Zonning_MRT12 Zonning_MRT16 Zonning_MSC 1 Zonning_MSC 3 Zonning_MU Zonning_MU12 Zonning_MUO Zonning_MUR Zonning_MUR35 Zonning_MUR45 Zonning_MUR70 Zonning_NB Zonning_NBP Zonning_NC Zonning_NC130 Zonning_NC140 Zonning_NC2-65 Zonning_NC230 Zonning_NC240 Zonning_NC265 Zonning_NC2P30 Zonning_NC2P40 Zonning_NC2P65 Zonning_NC340 Zonning_NC365 Zonning_NC385 Zonning_NC3P40 Zonning_NC3P65 Zonning_NC3P85 Zonning_NCBD Zonning_NRH 3 Zonning_O Zonning_OMU Zonning_OP Zonning_OS2 Zonning_OT Zonning_P Zonning_PBZ Zonning_PLA 15A Zonning_PLA 15B Zonning_PLA 16 Zonning_PLA 17 Zonning_PLA 2 Zonning_PLA 3C Zonning_PLA 5A Zonning_PLA 5B Zonning_PLA 5D Zonning_PLA 5E Zonning_PLA 6A Zonning_PLA 6D Zonning_PLA 6E Zonning_PLA 6F Zonning_PLA 6I Zonning_PLA 6J Zonning_PLA 6K Zonning_PLA 7B Zonning_PMM85 Zonning_PO Zonning_POSPF Zonning_PR Zonning_PR 2.4 Zonning_PRC Zonning_PRC1 Zonning_PRR Zonning_PSM 100100120 Zonning_PSM 100100130 Zonning_PSM 100120150 Zonning_PSM100 Zonning_PSM85120 Zonning_PUD Zonning_R Zonning_R 2800 Zonning_R 2800, OP Zonning_R 2800, OP, L Zonning_R 4000 Zonning_R 40000 Zonning_R 5400A, OP Zonning_R 5400D Zonning_R 7200 Zonning_R 8400 Zonning_R 9600 Zonning_R 9600, SSHO Zonning_R-14 Zonning_R-6 Zonning_R-8 Zonning_R1 Zonning_R1.8 Zonning_R10 Zonning_R12 Zonning_R12.5 Zonning_R12P Zonning_R12PSO Zonning_R12SO Zonning_R14 Zonning_R15 Zonning_R16 Zonning_R18 Zonning_R18P Zonning_R18PSO Zonning_R18SO Zonning_R1P Zonning_R1SO Zonning_R2 Zonning_R2.5 Zonning_R20 Zonning_R20A Zonning_R24 Zonning_R24SO Zonning_R3 Zonning_R3.5 Zonning_R30 Zonning_R4 Zonning_R4.5 Zonning_R40 Zonning_R48 Zonning_R48SO Zonning_R4C Zonning_R4P Zonning_R4PSO Zonning_R4SO Zonning_R5 Zonning_R6 Zonning_R6C Zonning_R6P Zonning_R6PSO Zonning_R6SO Zonning_R7 Zonning_R7.2 Zonning_R7.5 Zonning_R8 Zonning_R8.4 Zonning_R8P Zonning_R8PSO Zonning_R8SO Zonning_R9.6 Zonning_RA10 Zonning_RA10DPA Zonning_RA10PSO Zonning_RA10SO Zonning_RA2.5 Zonning_RA2.5P Zonning_RA2.5SO Zonning_RA3600 Zonning_RA5 Zonning_RA5P Zonning_RA5SO Zonning_RC Zonning_RCM Zonning_RH 8 Zonning_RIN SINGLE F Zonning_RL Zonning_RM Zonning_RM 1.8 Zonning_RM 1800 Zonning_RM 2.4 Zonning_RM 2400 Zonning_RM 3.6 Zonning_RM 3600 Zonning_RM 5 Zonning_RM 5.0 (1) Zonning_RM 900 Zonning_RM12 Zonning_RM18 Zonning_RM1800 Zonning_RM24 Zonning_RM2400 Zonning_RM3600 Zonning_RM48 Zonning_RM900 Zonning_RM900A Zonning_RMA 1.8 Zonning_RMA 2.4 Zonning_RMA 3.6 Zonning_RMA 5 Zonning_RMF Zonning_RMH Zonning_RML Zonning_RMU Zonning_RO Zonning_RS Zonning_RS 10000 Zonning_RS 11 Zonning_RS 12.5 Zonning_RS 15000 Zonning_RS 20000 Zonning_RS 35 Zonning_RS 5 Zonning_RS 6 Zonning_RS 6.3 Zonning_RS 7.2 Zonning_RS 7200 Zonning_RS 8.5 Zonning_RS 9600 Zonning_RS12000 Zonning_RS15 Zonning_RS15000 Zonning_RS35 Zonning_RS4000 Zonning_RS5 Zonning_RS7.2 Zonning_RS7200 Zonning_RS8400 Zonning_RS9.6 Zonning_RS9600 Zonning_RSA 1 Zonning_RSA 4 Zonning_RSA 6 Zonning_RSA 8 Zonning_RSE Zonning_RSLTC Zonning_RSR Zonning_RSX 35 Zonning_RSX 5 Zonning_RSX 7.2 Zonning_RSX 8.5 Zonning_SF 5000 Zonning_SF 5000PUD Zonning_SF 7200 Zonning_SF 9600 Zonning_SFD Zonning_SFE Zonning_SFR 10 Zonning_SFS Zonning_SFSL Zonning_SGC Zonning_SM65 Zonning_SM85 Zonning_SR1 Zonning_SR3 Zonning_SR30 Zonning_SR4.5 Zonning_SR6 Zonning_SR8 Zonning_SVV Zonning_T Zonning_TC Zonning_TC A1 Zonning_TC B Zonning_TC C Zonning_TC D Zonning_TC3 Zonning_TC4 Zonning_TL 8 Zonning_TSQ Zonning_UH1800 Zonning_UH900 Zonning_UHUCR Zonning_UL15000 Zonning_UL7200 Zonning_UL9600 Zonning_UM2400 Zonning_UM3600 Zonning_UNCL Zonning_UR Zonning_URP Zonning_URPSO Zonning_URSO Zonning_US R1 Zonning_UT1 Zonning_UV Zonning_UVEV Zonning_WD I Zonning_WD II Zonning_WD III HeatingCode_ HeatingCode_1 HeatingCode_11 HeatingCode_12 HeatingCode_13 HeatingCode_17 HeatingCode_18 HeatingCode_2 HeatingCode_20 HeatingCode_3 HeatingCode_4 HeatingCode_5 HeatingCode_9 HeatingCode_Y isRCON
0 1.0 20939 697 0 0 0 1 1.0 1 2004 2 6 0 0.0 14099 1696.0 47.770 -122.303 0 0.0 12.0 2 217000.0 1 2007 311.334290 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1.0 3999 1440 440 0 0 3 2.0 1 2006 3 8 1 2.0 207999 5696.0 47.722 -122.030 0 2.0 0.0 1 343750.0 0 2007 238.715278 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
2 1.0 61240 1030 0 0 0 3 1.0 1 1952 3 4 0 0.0 76499 2203.0 47.406 -122.040 0 0.0 15.0 3 275000.0 1 2006 266.990291 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Since the number of features from the encoded dataframe is quite big and this will increase the complexity of the computations for the ML models, I am going to filter as well these features, by removing the ones with a Pearson's correlation coefficient smaller than 0.08, analogically with the previous filtration step on the initial dataframe.

In [30]:
corr = encoded_df.corr()
relevant_features_from_encoded_df = []
for key in corr['isRCON'].keys():
    if abs(corr['isRCON'][key]) > 0.08 and key != 'isRCON':
        relevant_features_from_encoded_df.append(key)
In [31]:
len(relevant_features_from_encoded_df)
Out[31]:
75

From a total of 479 variables, 75 are going to be kept for model building.

In [32]:
X = encoded_df[relevant_features_from_encoded_df]
y = df['isRCON']

Imbalanced target class

Imbalanced datasets are those where there is a severe skew in the class distribution. This problem can be solved with resampling strategies. Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution.

In [33]:
df['isRCON'].value_counts()
Out[33]:
0    52814
1    11133
Name: isRCON, dtype: int64

Since the number of residential properties is bigger than the number of RCON real estates, it is possible to have some problems caused by the unbalanced dataset. I am going to build a prototype classification model to see how the imbalanced class affects the performance. As performance metrics, there are going to be analized the confusion matrix, the precision, recall and accuracy values. The confusion matrix that has the following structure: title

${\displaystyle {\textbf{Precision}}={\frac {tp}{tp+fp}}\,}$

${\displaystyle {\textbf{Recall}}={\frac {tp}{tp+fn}}\,}$

In [34]:
def get_performance_results_for_model(model, y_test, y_pred):
    print("Accuracy:", metrics.accuracy_score(y_test, y_pred), '\n\n')
    
    print('_________________ Confusion Matrix __________________')
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    fig = plt.figure(figsize=(7,2))
    sns.heatmap(conf_matrix.T, annot=True, fmt='d', 
                cmap='BuPu', cbar=False,
                xticklabels=list(set(y_test)),
                yticklabels=list(set(y_test)))
    plt.xlabel('true label')
    plt.ylabel('predicted label')
    plt.show()
    
    print('_______________ Classification Report _______________\n\n', 
          classification_report(y_test, y_pred))
In [35]:
from sklearn.svm import LinearSVC

def build_clf_model(X_train_df, X_test, y_train_df, y_test, random_state):
    clf = LinearSVC(random_state=random_state)
    clf.fit(X_train_df, y_train_df)
    pred = cross_val_predict(clf, X_test, y_test, cv=5)
    
    print('_____________________   Classifier  _______________________\n')
    get_performance_results_for_model(clf, y_test, pred)
    return clf
In [36]:
# train-test split: test dataframe will have 30% from the entire dataframe
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 

clf_model_imbalanced_df = build_clf_model(X_train, X_test, y_train, y_test, 0)
_____________________   Classifier  _______________________

Accuracy: 0.8931978107896794 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.95      0.92      0.93     15737
           1       0.68      0.76      0.72      3448

    accuracy                           0.89     19185
   macro avg       0.81      0.84      0.83     19185
weighted avg       0.90      0.89      0.90     19185

As it can be seen the model overlearns the majority clas, and the minority class is not predicted that well due to smaller precision and recall metrics.

Random oversampling duplicates examples from the minority class in the training dataset. Downside: overfitting for some models.

Random undersampling deletes examples from the majority class. Downside: losing information invaluable to a model.

Since the number of residential estates is bigger and represents 5/6 from the entire dataset, unsampling won't be chosen as a strategy for imbalanced class problem.

In [37]:
from imblearn.over_sampling import RandomOverSampler
print('Before oversampling: ', Counter(y_train))

oversample = RandomOverSampler(sampling_strategy='minority')
X_train_over, y_train_over = oversample.fit_resample(X_train, y_train)
print('After oversampling:', Counter(y_train_over))
Using TensorFlow backend.
Before oversampling:  Counter({0: 37077, 1: 7685})
After oversampling: Counter({0: 37077, 1: 37077})
In [38]:
X_train = X_train_over
y_train = y_train_over

clf_model_after_oversampling = build_clf_model(X_train, X_test, y_train, y_test, 42)
_____________________   Classifier  _______________________

Accuracy: 0.7574146468595256 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.95      0.74      0.83     15737
           1       0.41      0.84      0.55      3448

    accuracy                           0.76     19185
   macro avg       0.68      0.79      0.69     19185
weighted avg       0.86      0.76      0.78     19185

After oversampling it can be seen that the global accuracy was decreased, but this won't represent an impediment since we are interested in building a model with good predicatbility for both classes, and it seems that the second model predicts better the minority class.

Train-test split and 5-fold cross-validation preparing

title

Stratified cross-validation

Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole. For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances.

It is generally a better approach when dealing with both bias and variance. A randomly selected fold might not adequately represent the minor class, particularly in cases where there is a huge class imbalance.

title

In [39]:
skf = StratifiedKFold(n_splits=5, random_state=None)
fold_index = 1

for train_index, val_index in skf.split(X,y): 
    X_train, X_test = X.iloc[train_index], X.iloc[val_index] 
    y_train, y_test = y.iloc[train_index], y.iloc[val_index]
    
    print('Train ' + str(fold_index) + 'th fold y distribution: ', Counter(y_train))
    print('Test ' + str(fold_index) + 'th fold y distribution: ', Counter(y_test))
    print('___________________________________________________________')
    
    fold_index += 1
Train 1th fold y distribution:  Counter({0: 42251, 1: 8906})
Test 1th fold y distribution:  Counter({0: 10563, 1: 2227})
___________________________________________________________
Train 2th fold y distribution:  Counter({0: 42251, 1: 8906})
Test 2th fold y distribution:  Counter({0: 10563, 1: 2227})
___________________________________________________________
Train 3th fold y distribution:  Counter({0: 42252, 1: 8906})
Test 3th fold y distribution:  Counter({0: 10562, 1: 2227})
___________________________________________________________
Train 4th fold y distribution:  Counter({0: 42251, 1: 8907})
Test 4th fold y distribution:  Counter({0: 10563, 1: 2226})
___________________________________________________________
Train 5th fold y distribution:  Counter({0: 42251, 1: 8907})
Test 5th fold y distribution:  Counter({0: 10563, 1: 2226})
___________________________________________________________

As it can be noticed each fold has the same distribution for the target class due to the stratification. The StratifiedKFold can be performed as well using the function cross_val_predict from sklearn.model_selection with the parameter cv = integer, to specify the number of folds in a (Stratified)KFold.

Decission Trees classification

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
X_train, y_train = oversample.fit_resample(X_train, y_train) # oversampling for train
In [41]:
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)
Training Features Shape: (74154, 75)
Training Labels Shape: (74154,)
Testing Features Shape: (19185, 75)
Testing Labels Shape: (19185,)
In [42]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)

get_performance_results_for_model(clf, y_test, y_pred)
Accuracy: 0.9916080271045087 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       1.00      0.99      0.99     15737
           1       0.97      0.98      0.98      3448

    accuracy                           0.99     19185
   macro avg       0.98      0.99      0.99     19185
weighted avg       0.99      0.99      0.99     19185

In [43]:
dot_data = StringIO()
export_graphviz(clf, out_file = dot_data,  
                filled=True, rounded = True,
                special_characters = True, feature_names = relevant_features_from_encoded_df,
                class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('real_estates_dt.png')
Image(graph.create_png())
Out[43]:

It is known that decision-tree learners can create over-complex trees that do not generalise the data well (i.e. produce overfitting). Since the accuracy was 99.1% by using unpruned decission trees, I am going to built a model with max-depth 4 to see which are the most important features. For performance criterion I am going to use the entropy, instead of gini index.

Gini impurity measures the probability of a particular variable being wrongly classified when it is randomly chosen.

Information Gain is used to determine which feature/attribute gives us the maximum information about a class. It is based on the concept of entropy, which is the degree of disorder. It aims to reduce the level of entropy starting from the root node to the leave nodes.

Recap formulas

$ \textbf{Gini}(var) = 1 - \sum\limits_{i=1}^n {p_i}^2$, where $p_i$ is the probability of an object being classified to a particular class. The attribute/feature with the least Gini index is chosen as the root node.

$ \textbf{Entropy}(var) = \sum\limits_{i=1}^n -p_i \log_2{p_i}$

In [44]:
clf = DecisionTreeClassifier(criterion="entropy", max_depth=4)
clf = clf.fit(X_train,y_train)
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.9853531404743289

As it can be seen the accuracy is still a very good one, 98.5%. Let's take a look on the structure of the decission tree classifier.

In [45]:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = relevant_features_from_encoded_df,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('real_estates_dt_maxdepth3.png')

Image(graph.create_png())
Out[45]:

It can be seen that the missing value for HeatingCode has the biggest predictive power. It is followed by features like LandSqft, BuildingCode, YearBuilt, Zonning_SF, HeatingSourceCode, LandValue and View.

Feature selection using decission trees

There are going to be constructured a decision tree classifier for each feature. The features corresponding to the models with best performance results are going to be selected.

In [46]:
def get_accuracy_for_dt_classifier_based_on_df(X_given, y_given):
    X_train, X_test, y_train, y_test = train_test_split(X_given, y_given, test_size=0.4, random_state=1)
    
    clf = DecisionTreeClassifier()
    clf = clf.fit(X_train.values.reshape(-1,1),y_train)
    y_pred = clf.predict(X_test.values.reshape(-1,1))
    
    return metrics.accuracy_score(y_test, y_pred)

accuracies = [get_accuracy_for_dt_classifier_based_on_df(X[feature], y) for feature in list(X)]
indexes_best_accuracies = sorted(range(len(accuracies)), key=lambda i: accuracies[i], reverse=True)

There are going to be kept the predictors that have a bit bigger predictive value.

In [47]:
features_best_predictive_value_dtc = []

for index_feature in indexes_best_accuracies[:50]:
    features_best_predictive_value_dtc.append(list(X)[index_feature])
        
len(features_best_predictive_value_dtc)
Out[47]:
50
In [48]:
print('Predictors with biggest predictive value: \n\n', features_best_predictive_value_dtc[:10])
Predictors with biggest predictive value: 

 ['HeatingCode_', 'HeatingSourceCode', 'ConditionCode', 'Condition_FAI', 'LandValue', 'View', 'LivingSqft', 'LandSqft', 'ViewScore', 'BuildingCode']
In [49]:
print('Predictors that are going to be eliminated after decission tree selection: \n\n', 
      list(set(list(X)).difference(set(features_best_predictive_value_dtc))))
Predictors that are going to be eliminated after decission tree selection: 

 ['Zonning_R5', 'GarageCarportCode_GB0', 'BasementSqft', 'Zonning_SF 5000', 'FirePlaces', 'Condition_AVE', 'GarageSqft', 'SellPrice', 'HasPatioPorch', 'SellDate_Year', 'OwnerOccupied', 'Zonning_R4', 'Quality_QGO', 'GarageCarportCode_GA0', 'Condition_VGO', 'SellPricePerLivingSqft', 'Zonning_SF 7200', 'Quality_QAV', 'Condition_GOO', 'Zonning_RA5', 'Quality_QFA', 'Zonning_R6', 'YearBuilt', 'PatioPorchCode', 'HeatingCode_3']
In [50]:
X = X[features_best_predictive_value_dtc]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

K Nearest Neighbours

title

Finding the best k

In [51]:
error_rate = []

for i in range(1, 30):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_pred = cross_val_predict(knn, X_test, y_test, cv=5)
    error_rate.append(np.mean(y_pred != y_test))
In [52]:
plt.figure(figsize=(15,6))

plt.plot(range(1,30), error_rate, color='green', 
         linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=4)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[52]:
Text(0, 0.5, 'Error Rate')
In [53]:
best_k = np.argmin(error_rate) + 1
best_k
Out[53]:
2
In [54]:
worst_k = np.argmax(error_rate) + 1
worst_k
Out[54]:
1

Comparison between tunned k and worst k

In [55]:
knn = KNeighborsClassifier(n_neighbors = best_k)
knn.fit(X_train,y_train)
y_pred = cross_val_predict(knn, X_test, y_test, cv=5)
get_performance_results_for_model(knn, y_test, y_pred)
Accuracy: 0.9456194534579146 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.95      0.99      0.97     21002
           1       0.93      0.75      0.83      4577

    accuracy                           0.95     25579
   macro avg       0.94      0.87      0.90     25579
weighted avg       0.94      0.95      0.94     25579

In [56]:
knn = KNeighborsClassifier(n_neighbors = worst_k)
knn.fit(X_train,y_train)
y_pred = cross_val_predict(knn, X_test, y_test, cv=5)

get_performance_results_for_model(knn, y_test, y_pred)
Accuracy: 0.939051565737519 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.96      0.96      0.96     21002
           1       0.83      0.83      0.83      4577

    accuracy                           0.94     25579
   macro avg       0.90      0.89      0.90     25579
weighted avg       0.94      0.94      0.94     25579

The differences are notable, the accuracy for 2-NN is bigger, i.e. 94.5% comparing to 93.9% as well as the confusion matrix result long with presission and recall for both classes.

Gaussian Naive Bayes

title

Recap formulas

Bayes formula with the presumption of mutual independence between features: $P(Y | X_1, X_2, ...., X_n) = \frac{P(Y) \prod\limits_{i=1}^n P(X_i | Y)}{P(X_1, X_2, ...., X_n)} $

Bayes Naive tries to find: $\widehat{Y} = argmax_y P(Y) \prod\limits_{i=1}^n P(X_i | Y)$

The Gaussian Naive Bayes was chosen due to the continuous nature of some features. This type of NB assumes that the likelihood of the features is Gaussian: $P(X_i | Y) = \frac{1}{\sqrt{2\pi {\sigma(Y)}^2}} exp(- \frac{(X_i-\mu(Y))^2}{2\sigma(Y)}^2) $

In [57]:
from sklearn.naive_bayes import GaussianNB

gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, y_train)
y_pred = cross_val_predict(gaussian_nb, X_test, y_test, cv=5)

print('______________   Gaussian Naive Bayes _______________\n')
get_performance_results_for_model(gaussian_nb, y_test, y_pred)
______________   Gaussian Naive Bayes _______________

Accuracy: 0.8818953047421713 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.88      0.99      0.93     21002
           1       0.89      0.39      0.54      4577

    accuracy                           0.88     25579
   macro avg       0.88      0.69      0.74     25579
weighted avg       0.88      0.88      0.86     25579

Neural Networks

title

In [58]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-4,
                    hidden_layer_sizes=(100, 2), 
                    random_state=1)
clf.fit(X_train, y_train)
y_pred = cross_val_predict(clf, X_test, y_test)
get_performance_results_for_model(clf, y_test, y_pred)
Accuracy: 0.17893584581101685 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.00      0.00      0.00     21002
           1       0.18      1.00      0.30      4577

    accuracy                           0.18     25579
   macro avg       0.09      0.50      0.15     25579
weighted avg       0.03      0.18      0.05     25579

As it can be seen the neural network performs awful, since it classifies every instance as being RCON.

I am going to modify the configuration of the Neural Network, scaling the training date in a previous step.

In [59]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
scaled_data_train = sc.fit_transform(X_train)
scaled_data_test = sc.fit_transform(X_test)

clf = MLPClassifier(solver='adam', activation='tanh', 
                    alpha=1e-4, #regularization parameter for L2
                    hidden_layer_sizes=(100, 2), 
                    early_stopping=True,
                    random_state=1)

clf.fit(scaled_data_train, y_train)
y_pred = cross_val_predict(clf, scaled_data_test, y_test)
get_performance_results_for_model(clf, y_test, y_pred)
Accuracy: 0.9869815082685015 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      1.00      0.99     21002
           1       0.98      0.95      0.96      4577

    accuracy                           0.99     25579
   macro avg       0.98      0.97      0.98     25579
weighted avg       0.99      0.99      0.99     25579

The improvements are more than clear. I used 'tanh' due to the problems that logistic function have (vanishing gradients, gradient equat to 0 on $(-\infty, -4) and (4, +\infty) due to asimptotical nature of sigmoid function). I didn't used relu due to the problem of exploding gradients that can arise. I used ADAM (Adaptive Momentum) as optimiser since it is based on ADAGRAD (Adaptive gradient) and RMSPROP (Root Mean Square Propagation) optimisers.

I am going to try the same configuration on a neural network with more than 1 hidden layer to see how it performs. I would decrease the number of neurons from a layer to another to form that bottle neck effect.

In [60]:
len(list(X))
Out[60]:
50

The input layer will be formed of 50 neurons.

In [61]:
clf = MLPClassifier(solver='adam', activation='tanh', 
                    alpha=1e-4, #regularization parameter for L2
                    hidden_layer_sizes=(150, 50, 10, 2), 
                    early_stopping=True,
                    random_state=1)

clf.fit(scaled_data_train, y_train)
y_pred = cross_val_predict(clf, scaled_data_test, y_test)
get_performance_results_for_model(clf, y_test, y_pred)
Accuracy: 0.9895617498729427 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      1.00      0.99     21002
           1       0.98      0.96      0.97      4577

    accuracy                           0.99     25579
   macro avg       0.99      0.98      0.98     25579
weighted avg       0.99      0.99      0.99     25579

The differences between accuracies are not that notable, more exactly 3e-03. The complex architecture performs better but since the difference between accuracies is so small the simpler model will be chosen.

Support Vector Machines

title

The implementation for SVM from sklearn is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets it is sugested to be used sklearn.svm.LinearSVC or _sklearn.linearmodel.SGDClassifier instead, possibly after a _sklearn.kernelapproximation.Nystroem transformer.

LinearSVC is similar to SVC(Support Vector Classifier) with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Linear SVC with Nystroem kernel aproximation

The Nystroem method is a general method for low-rank approximations of kernels. It achieves this by essentially subsampling the data on which the kernel is evaluated. By default Nystroem uses the rbf kernel, but it can use any kernel function or a precomputed kernel matrix.

In [62]:
from sklearn.kernel_approximation import Nystroem
from sklearn.svm import LinearSVC

kernel_types = ['linear', 'poly', 'rbf', 'sigmoid']

def build_linersvc_with_nystroem_approx(kernel_type, verbose=False):
    clf = LinearSVC()
    feature_map_nystroem = Nystroem(kernel=kernel_type, 
                                    gamma=.4,
                                    random_state=1, 
                                    n_components=200)
    data_train_transformed = feature_map_nystroem.fit_transform(X_train)
    clf.fit(data_train_transformed, y_train)
    
    data_test_transformed = feature_map_nystroem.fit_transform(X_test)
    y_pred = cross_val_predict(clf, data_test_transformed, y_test)
    print('Accuracy for Linear SVC with ' + kernel_type + ' kernel: ', metrics.accuracy_score(y_test, y_pred))
    
    if verbose == True:
        get_performance_results_for_model(clf, y_test, y_pred)
    return clf
In [63]:
for kernel_type in kernel_types:
    build_linersvc_with_nystroem_approx(kernel_type, verbose=False)
Accuracy for Linear SVC with linear kernel:  0.9041792095077994
Accuracy for Linear SVC with poly kernel:  0.8193048985495914
Accuracy for Linear SVC with rbf kernel:  0.8216114781656828
Accuracy for Linear SVC with sigmoid kernel:  0.8210641541889832

Since the best kernel was the linear one, I am going to build a model but only using LinearSVC, without kernel approximation.

Linear SVC without kernel approximation

The defalt loss function for LinearSVC is squared_hinge, the penalty term is L2 (penalty with euclidian norm), and dual=True, which will imply that the algorithm will solve the dual optimisation problem. Let's see how the algorithm performs with the default configuration.

In [64]:
clf = LinearSVC(random_state=0)
clf.fit(X_train, y_train)
y_pred = cross_val_predict(clf, X_test, y_test)
get_performance_results_for_model(clf, y_pred, y_test)
Accuracy: 0.9143437976465069 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      0.91      0.95     22899
           1       0.55      0.95      0.70      2680

    accuracy                           0.91     25579
   macro avg       0.77      0.93      0.82     25579
weighted avg       0.95      0.91      0.92     25579

I am going to change the penalty term to L1, due to the sparsity among feature values, and I will set dual to False since the number_of_features > n_samples. I will set the tolerance to a smaller value, defalt is 1e-4, I will set it to 1e-7.

In [65]:
clf = LinearSVC(random_state=0, penalty='l1', dual=False, tol=1e-7)
clf.fit(X_train, y_train)
y_pred = cross_val_predict(clf, X_test, y_test)
get_performance_results_for_model(clf, y_pred, y_test)
Accuracy: 0.985104968919817 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       1.00      0.99      0.99     21211
           1       0.94      0.98      0.96      4368

    accuracy                           0.99     25579
   macro avg       0.97      0.98      0.97     25579
weighted avg       0.99      0.99      0.99     25579

The improvements are more than obvious, so that with the new configuration the global accuracy reached the value of 97.5%.

Feature Selection + SVC

In [66]:
from sklearn.feature_selection import SelectFromModel #to select the features with non-zero coeficients

clf = LinearSVC(random_state=0, penalty='l1', dual=False, tol=1e-7)
clf.fit(X_train, y_train)
model = SelectFromModel(clf, prefit=True)
X_feature_selected = model.transform(X)
In [67]:
feature_idx = model.get_support()
feature_names = X.columns[feature_idx]
len(feature_names)
Out[67]:
44
In [68]:
clf = LinearSVC(random_state=0, penalty='l1', dual=False, tol=1e-7)
clf.fit(X_train[feature_names], y_train)
y_pred = cross_val_predict(clf, X_test[feature_names], y_test)
get_performance_results_for_model(clf, y_pred, y_test)
Accuracy: 0.9847531177919387 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       1.00      0.99      0.99     21198
           1       0.94      0.98      0.96      4381

    accuracy                           0.98     25579
   macro avg       0.97      0.98      0.97     25579
weighted avg       0.99      0.98      0.98     25579

Ensemble learning

title

Algorithms that use bagging techniques

title

Random Forest

In [69]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators = 10, random_state = 42)
rf.fit(X_train, y_train);

y_pred = rf.predict(X_test)
y_pred = cross_val_predict(rf, X_test, y_test, cv=5)

get_performance_results_for_model(rf, y_test, y_pred)
Accuracy: 0.9944094765237108 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      1.00      1.00     21002
           1       1.00      0.97      0.98      4577

    accuracy                           0.99     25579
   macro avg       0.99      0.99      0.99     25579
weighted avg       0.99      0.99      0.99     25579

Features importance
In [70]:
feature_imp = pd.Series(rf.feature_importances_, index=list(X)).sort_values(ascending=False)
feature_imp[:10]
Out[70]:
HeatingCode_         0.234232
HeatingSourceCode    0.175162
ConditionCode        0.109851
View                 0.091835
LandValue            0.085925
Condition_FAI        0.077286
PropTaxAmount        0.044005
Bedrooms             0.036257
LandSqft             0.024873
Latitude             0.015182
dtype: float64
In [71]:
fig = plt.figure(figsize=(13,13))
sns.barplot(x=feature_imp, y=feature_imp.index)

plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

There are going to be selected the features which are most important in order to build a simplified classification model.

In [72]:
most_important_features = []
for key in feature_imp.keys():
    if feature_imp[key] > 0.01:
        most_important_features.append(key)
len(most_important_features)
Out[72]:
15
In [73]:
rf_most_important = RandomForestClassifier(n_estimators = 10, random_state = 42)
rf_most_important.fit(X_train[most_important_features], y_train);

y_pred = cross_val_predict(rf_most_important, X_test[most_important_features], y_test, cv=5)
errors = abs(y_pred - y_test)

print('Mean Absolute Error:', round(np.mean(errors), 6))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Mean Absolute Error: 0.007741
Accuracy: 0.9922592751866766

The difference between accuracies are not that consistent, i.e. 1e-03, the second model having an accuracy of 99.6% and the first one 99.7%. Therefore I am going to use the simplified model for the tunning stage in which we will search for the proper number of estimators.

Visualizing a single estimator
In [74]:
tree = rf_most_important.estimators_[0]
dot_data = StringIO()

export_graphviz(tree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, 
                feature_names = most_important_features, 
                class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('random_forest_first_estimator.png')

Image(graph.create_png())
Out[74]:
In [75]:
rf_small = RandomForestClassifier(n_estimators=10, max_depth = 3)
rf_small.fit(X_train[most_important_features], y_train)

y_pred = cross_val_predict(rf_small, X_test[most_important_features], y_test, cv=5)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

tree = rf_small.estimators_[0]
dot_data = StringIO()

export_graphviz(tree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, 
                feature_names = most_important_features, 
                class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('random_forest_first_estimator_prunned.png')

Image(graph.create_png())
Accuracy: 0.9747058133625239
Out[75]:

Since the decission trees can produce overfitting, I am going to tune not only the number of the estimators for the random forest algorithm but the tree's max depth as well.

In [76]:
import tqdm 

def get_acc_rf_classifier_given_n(n_estimators, max_depth=None):
    rf = RandomForestClassifier(n_estimators = n_estimators, max_depth=max_depth, random_state = 42)
    rf.fit(X_train[most_important_features], y_train);

    y_pred = cross_val_predict(rf, X_test[most_important_features], y_test, cv=5)
    return metrics.accuracy_score(y_test, y_pred)
    
accuracies_tunned_estimators_no = []
for i in tqdm.tqdm(range(50)):
    accuracies_tunned_estimators_no.append(get_acc_rf_classifier_given_n(i + 1))

best_n = np.argmax(accuracies_tunned_estimators_no) + 1

print("Accuracy RF with " + str(best_n) + ' is: ', accuracies_tunned_estimators_no[best_n-1])
100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [03:17<00:00,  3.95s/it]
Accuracy RF with 45 is:  0.9937448688377184

The result is not that surprising since the random forest is a greedy algorithm so it will have better performance with the increasing of the estimators.

In [77]:
accuracies_tunned_estimators_no[0]
Out[77]:
0.9849094960709958

As is can be seen even with a single estimator the random forest results are quite good, this is because the trees chosen overfit the data and they have to be prunned.

In [78]:
fig = plt.figure(figsize=(7,5))
plt.title('The evolution of the accuracy basen on the number of the estimators')
plt.plot(accuracies_tunned_estimators_no)
plt.xlabel('n_estimators')
plt.ylabel('global accuracy')
plt.show()
In [79]:
accuracies_tunned_depth = []
for depth in range(10):
    accuracies_tunned_depth.append(get_acc_rf_classifier_given_n(n_estimators=5, max_depth=depth + 1))
accuracies_tunned_depth
Out[79]:
[0.9573478243871926,
 0.96989718128152,
 0.9730247468626608,
 0.9769342038390868,
 0.9790062160365925,
 0.9835802806990109,
 0.9855741037569882,
 0.9865123734313304,
 0.988701669338129,
 0.9894835607334141]
In [80]:
fig = plt.figure(figsize=(7,5))
plt.title('The evolution of the accuracy basen on the decissions trees depth')
plt.plot(accuracies_tunned_depth)
plt.xlabel('max_depth')
plt.ylabel('global accuracy')
plt.show()

As the accuracy results show, the max_depth choice is also a greedy one. For an accuracy of 98.6% a random forest with 5 estimators with a max_depth of 9 will be more than enough.

In [81]:
rf = RandomForestClassifier(n_estimators = 5, max_depth=9, random_state = 42)
rf.fit(X_train, y_train);

y_pred = rf.predict(X_test)
y_pred = cross_val_predict(rf, X_test, y_test, cv=5)

get_performance_results_for_model(rf, y_test, y_pred)
Accuracy: 0.9865123734313304 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      1.00      0.99     21002
           1       0.99      0.93      0.96      4577

    accuracy                           0.99     25579
   macro avg       0.99      0.97      0.98     25579
weighted avg       0.99      0.99      0.99     25579

Extratrees Classifier

This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [82]:
from sklearn.ensemble import ExtraTreesClassifier

extc = ExtraTreesClassifier(n_estimators=10, min_samples_split=3)      
extc.fit(X_train, y_train) 

y_pred = cross_val_predict(extc, X_test, y_test, cv=5)
get_performance_results_for_model(extc, y_test, y_pred)
Accuracy: 0.9923765588959693 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      1.00      1.00     21002
           1       1.00      0.96      0.98      4577

    accuracy                           0.99     25579
   macro avg       0.99      0.98      0.99     25579
weighted avg       0.99      0.99      0.99     25579

In [83]:
def get_acc_et_classifier_given_n(n_estimators, max_depth=None):
    et = ExtraTreesClassifier(n_estimators = n_estimators, max_depth=max_depth, random_state = 42)
    et.fit(X_train[most_important_features], y_train);

    y_pred = cross_val_predict(et, X_test[most_important_features], y_test, cv=5)
    return metrics.accuracy_score(y_test, y_pred)
    
accuracies_tunned_estimators_no = []
for i in tqdm.tqdm(range(30)):
    accuracies_tunned_estimators_no.append(get_acc_et_classifier_given_n(i + 1))

best_n = np.argmax(accuracies_tunned_estimators_no) + 1

print("Accuracy ET with " + str(best_n) + ' is: ', accuracies_tunned_estimators_no[best_n-1])
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:53<00:00,  1.79s/it]
Accuracy ET with 27 is:  0.9931584502912545

In [84]:
fig = plt.figure(figsize=(7,5))
plt.title('The evolution of the accuracy basen on the number of the estimators')
plt.plot(accuracies_tunned_estimators_no)
plt.xlabel('n_estimators')
plt.ylabel('global accuracy')
plt.show()

Therefore the best number of estimators is 27, even though if we are tolerant simpler models return approximatly the same performance results.

In [89]:
extc = ExtraTreesClassifier(n_estimators=27, min_samples_split=3)      
extc.fit(X_train, y_train) 

y_pred = cross_val_predict(extc, X_test, y_test, cv=5)
get_performance_results_for_model(extc, y_test, y_pred)
Accuracy: 0.9928847883029047 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      1.00      1.00     21002
           1       1.00      0.97      0.98      4577

    accuracy                           0.99     25579
   macro avg       0.99      0.98      0.99     25579
weighted avg       0.99      0.99      0.99     25579

Algorithms that use boosting techniques

Extreme boosting: XGBoost

In [86]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)
y_pred = cross_val_predict(xgb_model, X_test, y_test, cv=5)

get_performance_results_for_model(xgb_model, y_test, y_pred)
Accuracy: 0.9929629774424332 


_________________ Confusion Matrix __________________
_______________ Classification Report _______________

               precision    recall  f1-score   support

           0       0.99      1.00      1.00     21002
           1       0.99      0.97      0.98      4577

    accuracy                           0.99     25579
   macro avg       0.99      0.99      0.99     25579
weighted avg       0.99      0.99      0.99     25579

Stacking

In [87]:
from mlxtend.classifier import StackingClassifier
from sklearn.linear_model import LogisticRegression

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(n_estimators=5, max_depth=9, random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], meta_classifier=lr)

for clf, label in zip([clf1, clf2, clf3, sclf], ['KNN', 'Random Forest','Naive Bayes','StackingClassifier']):
    clf.fit(X_train, y_train)
    y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
    print('Accuracy ' + label, metrics.accuracy_score(y_test, y_pred))
Accuracy KNN 0.939051565737519
Accuracy Random Forest 0.9875679268149654
Accuracy Naive Bayes 0.8818953047421713
Accuracy StackingClassifier 0.939051565737519
In [91]:
y_pred[:10]
Out[91]:
array([0, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

Decission boundaries visualisation (PCA as first step)

In [92]:
from mlxtend.plotting import plot_decision_regions
from sklearn.decomposition import PCA
import matplotlib.gridspec as gridspec
import itertools

gs = gridspec.GridSpec(2, 2)

fig = plt.figure(figsize=(15,15))

value=1.5
width=0.75

pca = PCA(n_components = 2)
X_train_after_pca = pca.fit_transform(X_train)

for clf, lab, grd in zip([clf1, clf2, clf3, sclf], ['KNN', 'Random Forest', 'Naive Bayes','StackingClassifier'],
                          itertools.product([0, 1], repeat=2)):

    clf.fit(X_train_after_pca, y_train)
    y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
    print('Accuracy ' + lab, metrics.accuracy_score(y_test, y_pred))
    
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X_train_after_pca, y=y_train.values, 
                                clf=clf)
    plt.title(lab)
Accuracy KNN 0.939051565737519
Accuracy Random Forest 0.9875679268149654
Accuracy Naive Bayes 0.8818953047421713
Accuracy StackingClassifier 0.939051565737519

Overview on the constructed models. Which is the best one?

In [93]:
pd.DataFrame({'Classifier':['Decission Tree Classifier (without prunning)', 
                             'Decission Tree Classifier max_depth=4',
                             '2-NN', 'Gaussian Naive Bayes',
                             'MLP Classsifier', 'Linear SVC',
                             'Random Forest', 'Extratrees', 'XGBoost', 'Stacked Classifier'],
             'Accuracy': [0.991608, 0.9853531, 0.9456194, 0.881895, 0.986981, 
                          0.9851049, 0.9944094, 0.992376, 0.9929629, 0.93905156]}).sort_values(by=['Accuracy'], 
                                                                                  ascending=False)
Out[93]:
Classifier Accuracy
6 Random Forest 0.994409
8 XGBoost 0.992963
7 Extratrees 0.992376
0 Decission Tree Classifier (without prunning) 0.991608
4 MLP Classsifier 0.986981
1 Decission Tree Classifier max_depth=4 0.985353
5 Linear SVC 0.985105
2 2-NN 0.945619
9 Stacked Classifier 0.939052
3 Gaussian Naive Bayes 0.881895

References